Study of ethnic residential segregation and health outcomes with clustering analysis in Bradford, UK

Purpose of this file is to explore UK ethnic census data.

It includes data extraction, data pre-processing and statistical analysis

Metadata :

Two data source was used to explore about ethnicity.

  1. 3 years UK census data (Ethnic groups) – 2001, 2011 and 2021
  2. Census Lookup data – 2001, 2011 and 2021

The UK census data : The UK census is a decennial survey conducted by the Office for National Statistics (ONS) to gather information about the population of the United Kingdom. 2021 census data is collected based on various categories such as demography and migration, UK armed forces veterans, ethnicity, identity, language and religion, work, travel, housing, health and education. This coding page uses ethnicity years census data was used namely 2001, 2011 and 2021.

The UK census lookup data: This data was used to find the relevant geographical unit from the UK census data. They specifically provide attributes such as LSOA, MSOA, OA and LAD geographical code and their geographical names. The lookup data is mainly associated with geographical information systems (GIS) to analyze and visualize geospatial data. The UK census lookup data is also released along with the UK census data every ten years. Using UK census data’s LSOA codes the geographical location of the specific code is extracted from the lookup data for further analysis of segregation and applying machine learning algorithm.

Link to datasource: https://www.ons.gov.uk/datasets/TS021/editions/2021/versions/2/filter-outputs/84b48c92-aced-4b75-a761-4213b8c6bb94#get-data

In [156]:
# Import required packages:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.offline.init_notebook_mode()
from plotly.offline import iplot
import plotly.express as px

import warnings
warnings.filterwarnings("ignore")

1. 2021 census data - UK¶

In [157]:
# Load 2021 UK census and LAD dataset

df_2021 = pd.read_csv("Dataset/Ethnic_census_data/2021_data/TS021-2021-2-filtered-2023-03-19T11-48-56Z.csv")
df_lad_2021 = pd.read_csv("Dataset/Ethnic_census_data/2021_data/OA21_LSOA21_MSOA21_LAD22_EW_LU.csv", encoding='latin-1')
In [158]:
#Check the shape of the dataframe

print(df_2021.shape)
print(df_lad_2021.shape)
(3777600, 5)
(188880, 8)
In [159]:
#Visualize the UK 2021 census data 

df_2021.head(2)
Out[159]:
Output Areas Code Output Areas Ethnic group (20 categories) Code Ethnic group (20 categories) Observation
0 E00000001 E00000001 -8 Does not apply 0
1 E00000001 E00000001 1 Asian, Asian British or Asian Welsh: Bangladeshi 0
In [160]:
#Visualizing the UK 2021 LAD data 

df_lad_2021.head(2)
Out[160]:
oa21cd lsoa21cd lsoa21nm msoa21cd msoa21nm lad22cd lad22nm lad22nmw
0 E00000001 E01000001 City of London 001A E02000001 City of London 001 E09000001 City of London NaN
1 E00000003 E01000001 City of London 001A E02000001 City of London 001 E09000001 City of London NaN
In [161]:
# Renaming column names in  data

df_2021 = df_2021.rename(columns={'Ethnic group (20 categories) Code': 'Ethnic_group_code', 'Ethnic group (20 categories)' : 'Ethnic_group'})

# Remove special character in the OA data columns

df_2021.columns = df_2021.columns.str.replace(' ', '_')
In [162]:
# counting unique values in 2021 census  data

n = len(pd.unique(df_lad_2021["oa21cd"]))
print("No.of.unique values :",n)
No.of.unique values : 188880
In [163]:
#Transposing the column Ethnic group (20 categories) in census data

df_2021 = df_2021.pivot(index='Output_Areas',columns='Ethnic_group', values='Observation')
df_2021.head(1)
Out[163]:
Ethnic_group Asian, Asian British or Asian Welsh: Bangladeshi Asian, Asian British or Asian Welsh: Chinese Asian, Asian British or Asian Welsh: Indian Asian, Asian British or Asian Welsh: Other Asian Asian, Asian British or Asian Welsh: Pakistani Black, Black British, Black Welsh, Caribbean or African: African Black, Black British, Black Welsh, Caribbean or African: Caribbean Black, Black British, Black Welsh, Caribbean or African: Other Black Does not apply Mixed or Multiple ethnic groups: Other Mixed or Multiple ethnic groups Mixed or Multiple ethnic groups: White and Asian Mixed or Multiple ethnic groups: White and Black African Mixed or Multiple ethnic groups: White and Black Caribbean Other ethnic group: Any other ethnic group Other ethnic group: Arab White: English, Welsh, Scottish, Northern Irish or British White: Gypsy or Irish Traveller White: Irish White: Other White White: Roma
Output_Areas
E00000001 0 4 0 0 1 0 3 0 0 1 5 0 0 7 0 112 0 0 39 4
In [164]:
#Reset index for OA dataframe

df_2021 = df_2021.reset_index()

Combining 21 ethnic groups to 4

In 2021 census data there are 21 ethnic groups. Roma was one of new ethnic group added this year. To reduce the number of ethnicity, a person can be identified through one of the following four high-level ethnic groups as per www.ons.gov.uk:

  • "Asian, Asian British, Asian Welsh"
  • "Black, Black British, Black Welsh, Caribbean or African"
  • "White"
  • "Mixed or Multiple", "Other ethnic group"
In [165]:
# Checking data type for all the variables in OA data

df_2021.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188880 entries, 0 to 188879
Data columns (total 21 columns):
 #   Column                                                                  Non-Null Count   Dtype 
---  ------                                                                  --------------   ----- 
 0   Output_Areas                                                            188880 non-null  object
 1   Asian, Asian British or Asian Welsh: Bangladeshi                        188880 non-null  int64 
 2   Asian, Asian British or Asian Welsh: Chinese                            188880 non-null  int64 
 3   Asian, Asian British or Asian Welsh: Indian                             188880 non-null  int64 
 4   Asian, Asian British or Asian Welsh: Other Asian                        188880 non-null  int64 
 5   Asian, Asian British or Asian Welsh: Pakistani                          188880 non-null  int64 
 6   Black, Black British, Black Welsh, Caribbean or African: African        188880 non-null  int64 
 7   Black, Black British, Black Welsh, Caribbean or African: Caribbean      188880 non-null  int64 
 8   Black, Black British, Black Welsh, Caribbean or African: Other Black    188880 non-null  int64 
 9   Does not apply                                                          188880 non-null  int64 
 10  Mixed or Multiple ethnic groups: Other Mixed or Multiple ethnic groups  188880 non-null  int64 
 11  Mixed or Multiple ethnic groups: White and Asian                        188880 non-null  int64 
 12  Mixed or Multiple ethnic groups: White and Black African                188880 non-null  int64 
 13  Mixed or Multiple ethnic groups: White and Black Caribbean              188880 non-null  int64 
 14  Other ethnic group: Any other ethnic group                              188880 non-null  int64 
 15  Other ethnic group: Arab                                                188880 non-null  int64 
 16  White: English, Welsh, Scottish, Northern Irish or British              188880 non-null  int64 
 17  White: Gypsy or Irish Traveller                                         188880 non-null  int64 
 18  White: Irish                                                            188880 non-null  int64 
 19  White: Other White                                                      188880 non-null  int64 
 20  White: Roma                                                             188880 non-null  int64 
dtypes: int64(20), object(1)
memory usage: 30.3+ MB

Data type is checked to see if all ethnic group fields is an integer for further computing.

In [166]:
# Computing 4 ethnic groups : Select required columns to sum using DataFrame to create new variables with 4 only ethinic group for further analysis

df_2021['Asian']=df_2021.iloc[:,1:6].sum(axis=1)
df_2021['Black']=df_2021.iloc[:,6:9].sum(axis=1)
df_2021['White']=df_2021.iloc[:,16:21].sum(axis=1)
df_2021['Others']=df_2021.iloc[:,9:16].sum(axis=1)
In [167]:
# Dropping other features and including only required features in OA dataset

df_2021.drop(df_2021.iloc[:,1:-4], inplace=True, axis=1)
In [168]:
# Checking the null values in Census 2021 data

df_2021.isnull().sum()
Out[168]:
Ethnic_group
Output_Areas    0
Asian           0
Black           0
White           0
Others          0
dtype: int64

No null values were identifed.

In [169]:
#Statistical analysis

df_2021.describe()
Out[169]:
Ethnic_group Asian Black White Others
count 188880.000000 188880.000000 188880.000000 188880.000000
mean 29.203155 12.754024 257.833095 15.745775
std 57.525246 26.608705 91.101243 18.159850
min 0.000000 0.000000 0.000000 0.000000
25% 2.000000 0.000000 209.000000 4.000000
50% 9.000000 3.000000 261.000000 10.000000
75% 28.000000 12.000000 304.000000 21.000000
max 986.000000 902.000000 2978.000000 435.000000
Insight: Statistical analysis shows that White ethnic group are larger in number, second highest is the Asians.
In [170]:
# Renaming LDA columns for joining the data with census data

df_lad_2021.rename(columns={"oa21cd": "Output_Areas", "lad22nm" : "LAD", "lsoa21cd":"LSOA"}, inplace=True)

#Dropping the other columns

df_lad_to_join = df_lad_2021[['Output_Areas', 'LAD', "LSOA"]].copy()
df_lad_to_join.head(2)
Out[170]:
Output_Areas LAD LSOA
0 E00000001 City of London E01000001
1 E00000003 City of London E01000001
In [171]:
# Join OA census 2021 and LAD dataframe

join_data = pd.merge(left = df_2021, right = df_lad_to_join, how = 'left', on = 'Output_Areas' )
join_data.head(1)
Out[171]:
Output_Areas Asian Black White Others LAD LSOA
0 E00000001 5 3 155 13 City of London E01000001
In [172]:
# Re-arranging the order of the column in joined dataset

join_data = join_data.reindex(['LSOA','LAD','Output_Areas','Asian', 'Black', 'White','Others'], axis=1)
join_data.head(2)
Out[172]:
LSOA LAD Output_Areas Asian Black White Others
0 E01000001 City of London E00000001 5 3 155 13
1 E01000001 City of London E00000003 24 5 209 17
In [173]:
# counting unique values in join_data

n = len(pd.unique(join_data["LSOA"]))
print("No.of.unique values :",n)

# counting unique values in join_data

n = len(pd.unique(join_data["LAD"]))
print("No.of.unique values :",n)
No.of.unique values : 35672
No.of.unique values : 331
Insight: Unique values shows that there are 35672 LSOA and 331 Local Authourity Districts in the UK.
In [174]:
#Saving joined 2021 census dataset:

join_data.to_csv("Dataset/Processed_Data/processed_ethnic_data_2021.csv")
In [175]:
#Loading the joined census dataset

df_census_2021 = pd.read_csv("Dataset/Processed_Data/processed_ethnic_data_2021.csv", index_col=0)
In [176]:
# Use groupby() to compute the sum
df2 =df_census_2021.groupby(['LAD']).sum()
df2 = df2.reset_index()
df2.head(2)
Out[176]:
LAD Asian Black White Others
0 Adur 1388 536 60284 2329
1 Allerdale 541 85 94671 853
In [177]:
#Description of Datasets

df2.describe(include='all')
Out[177]:
LAD Asian Black White Others
count 331 331.000000 331.000000 331.000000 331.000000
unique 331 NaN NaN NaN NaN
top Adur NaN NaN NaN NaN
freq 1 NaN NaN NaN NaN
mean NaN 16664.326284 7277.885196 147128.444109 8985.081571
std NaN 32901.122980 15359.065822 87905.046701 12421.411368
min NaN 6.000000 4.000000 2004.000000 40.000000
25% NaN 1874.000000 583.500000 94197.500000 2256.500000
50% NaN 4928.000000 1397.000000 122759.000000 4116.000000
75% NaN 14998.000000 5225.500000 165033.000000 8891.000000
max NaN 355386.000000 125741.000000 641847.000000 107194.000000
In [178]:
dfplot = df2.sort_values(by="Asian",ascending=False).head(20)

#Using a horizontal bar chat : Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Asian')

#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='White')

#Adding legend and labels
plt.legend(loc='center right', prop={'size': 6})
plt.title('popluation percentage by Ethnic Groups - Asian and White')
plt.xlabel('Poplulation')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)

#Displaying the plot
plt.show()
In [179]:
dfplot = df2.sort_values(by="Black",ascending=False).head(20)

#Using a horizontal bar chat : Ploting the popluation of Others
plt.barh(dfplot.LAD, dfplot.Asian, label='Black')

#Ploting the popluation of White
plt.barh(dfplot.LAD, dfplot.Black, label='Others')

#Adding legend and labels
plt.legend(loc='center right', prop={'size': 6})
plt.title('popluation percentage by Ethnic Groups - Black and Others')
plt.xlabel('Poplulation')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)

#Displaying the plot
plt.show()
Insight: From the barplot which compares Asians and White, it can be seen Asians reside more in numbers in LAD than white. Similary, ethinic group falling under Others are more than black.
In [180]:
#Total popluation in the UK - creatin this for visualization purpose

Total_UK_2021_asian = df2['Asian'].sum()
Total_UK_2021_black = df2['Black'].sum()
Total_UK_2021_white = df2['White'].sum()
Total_UK_2021_others = df2['Others'].sum()

1.1 2021 census data - Bradford¶

In [181]:
# Selecting only Bradford from the joined dataset

data_bradford = df_census_2021[(df_census_2021.LAD == "Bradford")]
In [182]:
data_bradford.head(2)
Out[182]:
LSOA LAD Output_Areas Asian Black White Others
50292 E01010573 Bradford E00053353 10 0 422 3
50293 E01010573 Bradford E00053354 3 1 317 6
In [183]:
#checking the size of census data

data_bradford.shape
Out[183]:
(1575, 7)
In [184]:
# counting unique values in join_data

n = len(pd.unique(data_bradford["LSOA"]))
print("No.of.unique values :",n)
No.of.unique values : 312
Insight: There are 312 LSOAs in Bradford
In [185]:
#statistical analysis of census data

data_bradford.describe()
Out[185]:
Asian Black White Others
count 1575.000000 1575.000000 1575.000000 1575.000000
mean 111.524444 6.987302 212.053968 16.365079
std 142.526573 10.622246 108.529429 13.891371
min 0.000000 0.000000 0.000000 0.000000
25% 8.000000 0.000000 126.000000 6.000000
50% 35.000000 3.000000 232.000000 12.000000
75% 185.000000 9.000000 283.000000 23.000000
max 586.000000 102.000000 582.000000 83.000000
Insight: In Bradford, the average of white ethnic group is doubled the mean of Asians.
In [186]:
# creatin this for visualization purpose

Total_Bradford_2021_asian = data_bradford['Asian'].sum()
Total_Bradford_2021_black = data_bradford['Black'].sum()
Total_Bradford_2021_white = data_bradford['White'].sum()
Total_Bradford_2021_others = data_bradford['Others'].sum()
In [187]:
#Visualizing the range and distribution of numerical Variables
#Distribution of people in Bradford

fig, ax = plt.subplots(4, 2, figsize = (8, 6))

sns.boxplot(x= data_bradford["Asian"], ax = ax[0,0])
sns.distplot(data_bradford['Asian'], ax = ax[0,1])
sns.boxplot(x= data_bradford["Black"], ax = ax[1,0])
sns.distplot(data_bradford['Black'], ax = ax[1,1])
sns.boxplot(x= data_bradford["White"], ax = ax[2,0])
sns.distplot(data_bradford['White'], ax = ax[2,1])
sns.boxplot(x= data_bradford["Others"], ax = ax[3,0])
sns.distplot(data_bradford['Others'], ax = ax[3,1])
plt.tight_layout()
Insight: Boxplot and distribution plot provides visuals of outliers and skewness in the data. It can be seen that Asians, black and white are right skewed and black has bimodal distribution. Boxplot tells that there are many outliers in all the 4 variables.

2. 2011 census data - UK¶

Metadata :
census - https://www.nomisweb.co.uk/query/construct/submit.asp?menuopt=201&subcomp=


LDA : https://geoportal.statistics.gov.uk/datasets/ons::output-area-to-lower-layer-super-output-area-to-middle-layer-super-output-area-to-local-authority-district-december-2011-lookup-in-england-and-wales-1/explore
In [188]:
#Load the 2011 OA and LAD dataset

df_2011 = pd.read_csv("Dataset/Ethnic_census_data/2011_data/2011_1850896343189182.csv", sep=';', header=None, encoding='latin-1')
df_lad_2011 = pd.read_csv("Dataset/Ethnic_census_data/2011_data/Output_Area_to_Lower_layer_Super_Output_Area_to_Middle_layer_Super_Output_Area_to_Local_Authority_District_(December_2011)_Lookup_in_England_and_Wales.csv", encoding='latin-1')
In [189]:
#Using drop()function to delete first 5 rows in OA dataset

df_2011.drop(index=df_2011.index[:6], axis=0, inplace=True)
In [190]:
#Renaming the column name in OA dataset

df_2011.rename(columns={df_2011.columns[0]: 'Ethnic_Group'},inplace=True)
print(df_2011.columns)
Index(['Ethnic_Group'], dtype='object')
In [191]:
#Splitting the first column into multiple columns in OA data

df_2011 = df_2011.Ethnic_Group.str.split(',',expand=True)
In [192]:
# Set First Row as header in OA data

df_2011.columns = df_2011.iloc[0]
df_2011 = df_2011[1:]
In [193]:
# checking the column names in 2011 census OA data

df_2011.columns
Out[193]:
Index(['2011 output area', '"All usual residents"', '"White"',
       '"White: English/Welsh/Scottish/Northern Irish/British"',
       '"White: Irish"', '"White: Gypsy or Irish Traveller"',
       '"White: Other White"', '"Mixed/multiple ethnic groups"',
       '"Mixed/multiple ethnic groups: White and Black Caribbean"',
       '"Mixed/multiple ethnic groups: White and Black African"',
       '"Mixed/multiple ethnic groups: White and Asian"',
       '"Mixed/multiple ethnic groups: Other Mixed"', '"Asian/Asian British"',
       '"Asian/Asian British: Indian"', '"Asian/Asian British: Pakistani"',
       '"Asian/Asian British: Bangladeshi"', '"Asian/Asian British: Chinese"',
       '"Asian/Asian British: Other Asian"',
       '"Black/African/Caribbean/Black British"',
       '"Black/African/Caribbean/Black British: African"',
       '"Black/African/Caribbean/Black British: Caribbean"',
       '"Black/African/Caribbean/Black British: Other Black"',
       '"Other ethnic group"', '"Other ethnic group: Arab"',
       '"Other ethnic group: Any other ethnic group"'],
      dtype='object', name=6)
In [194]:
#Remove special character in headers

df_2011.columns=df_2011.columns.str.replace('["]','')
In [195]:
#Renaming the column names

df_2011.rename(columns={'White' : 'All_white',
                            '2011 output area': 'Output_Areas'}, inplace=True)
In [196]:
#changing the datatype for all the columns to numeric for further computation in OA data

df_2011['All_white'] = pd.to_numeric(df_2011['All_white'], errors='coerce')
df_2011['White: English/Welsh/Scottish/Northern Irish/British'] = pd.to_numeric(df_2011['White: English/Welsh/Scottish/Northern Irish/British'], errors='coerce')
df_2011['White: Irish'] = pd.to_numeric(df_2011['White: Irish'], errors='coerce')
df_2011['White: Gypsy or Irish Traveller'] = pd.to_numeric(df_2011['White: Gypsy or Irish Traveller'], errors='coerce')
df_2011['White: Other White'] = pd.to_numeric(df_2011['White: Other White'], errors='coerce')

df_2011['Mixed/multiple ethnic groups'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups'], errors='coerce')
df_2011['Mixed/multiple ethnic groups: White and Black Caribbean'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups: White and Black Caribbean'], errors='coerce')
df_2011['Mixed/multiple ethnic groups: White and Black African'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups: White and Black African'], errors='coerce')
df_2011['Mixed/multiple ethnic groups: White and Asian'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups: White and Asian'], errors='coerce')
df_2011['Mixed/multiple ethnic groups: Other Mixed'] = pd.to_numeric(df_2011['Mixed/multiple ethnic groups: Other Mixed'], errors='coerce')

df_2011['Asian/Asian British'] = pd.to_numeric(df_2011['Asian/Asian British'], errors='coerce')
df_2011['Asian/Asian British: Indian'] = pd.to_numeric(df_2011['Asian/Asian British: Indian'], errors='coerce')
df_2011['Asian/Asian British: Pakistani'] = pd.to_numeric(df_2011['Asian/Asian British: Pakistani'], errors='coerce')
df_2011['Asian/Asian British: Bangladeshi'] = pd.to_numeric(df_2011['Asian/Asian British: Bangladeshi'], errors='coerce')
df_2011['Asian/Asian British: Chinese'] = pd.to_numeric(df_2011['Asian/Asian British: Chinese'], errors='coerce')
df_2011['Asian/Asian British: Other Asian'] = pd.to_numeric(df_2011['Asian/Asian British: Other Asian'], errors='coerce')

df_2011['Black/African/Caribbean/Black British'] = pd.to_numeric(df_2011['Black/African/Caribbean/Black British'], errors='coerce')
df_2011['Black/African/Caribbean/Black British: African'] = pd.to_numeric(df_2011['Black/African/Caribbean/Black British: African'], errors='coerce')
df_2011['Black/African/Caribbean/Black British: Caribbean'] = pd.to_numeric(df_2011['Black/African/Caribbean/Black British: Caribbean'], errors='coerce')
df_2011['Black/African/Caribbean/Black British: Other Black'] = pd.to_numeric(df_2011['Black/African/Caribbean/Black British: Other Black'], errors='coerce')

df_2011['Other ethnic group'] = pd.to_numeric(df_2011['Other ethnic group'], errors='coerce')
df_2011['Other ethnic group: Arab'] = pd.to_numeric(df_2011['Other ethnic group: Arab'], errors='coerce')
df_2011['Other ethnic group: Any other ethnic group'] = pd.to_numeric(df_2011['Other ethnic group: Any other ethnic group'], errors='coerce')
In [197]:
# Select required columns to sum using DataFrame.iloc[] 

df_2011['Asian']=df_2011.iloc[:,12:18].sum(axis=1)
df_2011['Black']=df_2011.iloc[:,18:22].sum(axis=1)
df_2011['White']=df_2011.iloc[:,2:7].sum(axis=1)
df_2011['Others']=df_2011.iloc[:,[7,8,9,10,11,22,23,24]].sum(axis=1)
In [198]:
# Dropping other features and including only required features

df_2011.drop(df_2011.iloc[:,1:-4], inplace=True, axis=1)
df_2011.head(1)
Out[198]:
6 Output_Areas Asian Black White Others
7 E00000001 12.0 0.0 350.0 26.0
In [199]:
df_lad_2011.head(1)
Out[199]:
OA11CD LSOA11CD LSOA11NM MSOA11CD MSOA11NM LAD11CD LAD11NM LAD11NMW ObjectId
0 E00000001 E01000001 City of London 001A E02000001 City of London 001 E09000001 City of London NaN 1
In [200]:
#Renaming LDA columns

df_lad_2011.rename(columns={"OA11CD": "Output_Areas", "LAD11NM" : "LAD", "LSOA11CD":"LSOA"}, inplace=True)

#Dropping the other columns

df_lad_to_join_2011 = df_lad_2011[['Output_Areas', 'LAD', "LSOA"]].copy() # To avoid the case where changing df1 also changes df
df_lad_to_join_2011.head(1)
Out[200]:
Output_Areas LAD LSOA
0 E00000001 City of London E01000001
In [201]:
# Join 2011 census and LAD dataframe

join_data_2011 = pd.merge(left = df_2011, right = df_lad_to_join_2011, how = 'left', on = 'Output_Areas' )
In [202]:
# Re-arranging the order of the column in joined dataset

join_data_2011 = join_data_2011.reindex(['LSOA','LAD','Output_Areas','Asian', 'Black', 'White','Others'], axis=1)
join_data_2011.head(1)
Out[202]:
LSOA LAD Output_Areas Asian Black White Others
0 E01000001 City of London E00000001 12.0 0.0 350.0 26.0
In [203]:
# counting unique values in join_data

n = len(pd.unique(join_data_2011["LSOA"]))
print("No.of.unique values :",n)

# counting unique values in join_data

n = len(pd.unique(join_data_2011["LAD"]))
print("No.of.unique values :",n)
No.of.unique values : 34754
No.of.unique values : 349
Insight: Unique values shows that there are 34754 LSOA and 349 Local Authourity Districts in the UK.
In [204]:
# converting 'Weight' from float to int
join_data_2011['Asian'] = join_data_2011['Asian'].astype(int)
join_data_2011['Black'] = join_data_2011['Black'].astype(int)
join_data_2011['White'] = join_data_2011['White'].astype(int)
join_data_2011['Others'] = join_data_2011['Others'].astype(int)
In [205]:
#Saving joined dataset 2011:

join_data_2011.to_csv("Dataset/Processed_Data/processed_ethnic_data_2011.csv")
In [206]:
#Loading the joined census dataset

df_census_2011 = pd.read_csv("Dataset/Processed_Data/processed_ethnic_data_2011.csv", index_col=0)
df_census_2011.head(2)
Out[206]:
LSOA LAD Output_Areas Asian Black White Others
0 E01000001 City of London E00000001 12 0 350 26
1 E01000001 City of London E00000003 52 6 412 30
In [207]:
df_census_2011.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 181412 entries, 0 to 181411
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   LSOA          181408 non-null  object
 1   LAD           181408 non-null  object
 2   Output_Areas  181408 non-null  object
 3   Asian         181412 non-null  int64 
 4   Black         181412 non-null  int64 
 5   White         181412 non-null  int64 
 6   Others        181412 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 11.1+ MB
Insight: Datatype of the varaibles is checked for computation purpose.
In [208]:
dfplot = df_census_2011.sort_values(by="Asian",ascending=False).head(20)

#Using a horizontal bar chat
#Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Asian')

#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='White')

#Adding legend and labels
plt.legend(loc='center right', prop={'size': 8})
plt.title('popluation by Ethnic Groups - Asians and White')
plt.xlabel('Poplulation - 2011')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)

#Displaying the plot
plt.show()
In [209]:
dfplot = df_census_2011.sort_values(by="Black",ascending=False).head(20)

#Using a horizontal bar chat
#Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Black')

#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='Others')

#Adding legend and labels
plt.legend(loc='center right', prop={'size': 8})
plt.title('popluation by Ethnic Groups - Black and others')
plt.xlabel('Poplulation - 2011')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)

#Displaying the plot
plt.show()
Insight: From the barplot which compares Asians and White, it can be seen Asians reside more in numbers in LAD than white. Wherelse, ethinic group comparison between Black and Others it can be seen that Others are more than black.
In [210]:
#Total popluation in the UK - # creatin this for visualization purpose

Total_UK_2011_asian = df_census_2011['Asian'].sum()
Total_UK_2011_black = df_census_2011['Black'].sum()
Total_UK_2011_white = df_census_2011['White'].sum()
Total_UK_2011_others = df_census_2011['Others'].sum()

2.1 2011 census data - Bradford¶

In [211]:
# Selecting only Bradford from the joined dataset

data_bradford_2011 = df_census_2011[(df_census_2011.LAD == "Bradford")]
In [212]:
# creating new variables for visualization purpose

Total_Bradford_2011_asian = data_bradford_2011['Asian'].sum()
Total_Bradford_2011_black = data_bradford_2011['Black'].sum()
Total_Bradford_2011_white = data_bradford_2011['White'].sum()
Total_Bradford_2011_others = data_bradford_2011['Others'].sum()

3. 2001 census data - UK¶

In [213]:
#Load the OA census 2001 and LAD dataset

df_2001 = pd.read_csv("Dataset/Ethnic_census_data/2001_data/2001_census_data.csv", sep=';', header=None, encoding='latin-1')
df_lad_2001 = pd.read_csv("Dataset/Ethnic_census_data/2001_data/Output_Area_to_Lower_Layer_Super_Output_Area_to_Middle_Layer_Super_Output_Area_(December_2001)_Lookup_in_England_and_Wales.csv", encoding='latin-1')
In [214]:
#Set first Row as Header in 2001 census data

df_2001.columns = df_2001.iloc[0]
df_2001 = df_2001[1:]
In [215]:
# Taking only first columns because it has all the values for the ethnic group and dropping other columns

df_2001.drop(df_2001.iloc[:,1:], inplace=True, axis=1)
df_2001.head(1)
Out[215]:
date,"geography","geography code","Ethnic Group: All categories: Ethnic group
1 2001,"00EHNC0001","00EHNC0001",261,255,255,0,0...
In [216]:
#Renaming the column name

df_2001.rename(columns={df_2001.columns[0]: 'Features'},inplace=True)
print(df_2001.columns)
Index(['Features'], dtype='object', name=0)
In [217]:
#Splitting the first column into multiple columns

df_2001 = df_2001.Features.str.split(',',expand=True)
In [218]:
df_2001.shape
Out[218]:
(175434, 25)
In [219]:
df_2001.columns = ['date',
                      'geography', 
                      'geography_code', 
                      'All_categories',
                      
                      'All_White',
                      'White_British',
                      'White_Irish',
                      'White_Other',
                      
                      'Mixed',
                      'Mixed_White_Black_Caribbean',
                      'Mixed_White_Black_African',
                      'Mixed_White_Asian',
                      'Mixed_Other',
                      
                      'Asian_Asian_British',
                      'Asian_Asian_British_Indian',
                      'Asian_Asian_British_Pakistani',
                      'Asian_Asian_British_Bangladeshi',
                      'Asian_Asian_British_Other',
                      
                      'Black_Black_British',
                      'Black_Black_Caribbean',
                      'Black_Black_African',
                      'Black_other',
                      
                      'Chinese_other',
                      'Chinese_other_Chinese',
                      'Chinese_other_other']
print(df_2001.columns)
Index(['date', 'geography', 'geography_code', 'All_categories', 'All_White',
       'White_British', 'White_Irish', 'White_Other', 'Mixed',
       'Mixed_White_Black_Caribbean', 'Mixed_White_Black_African',
       'Mixed_White_Asian', 'Mixed_Other', 'Asian_Asian_British',
       'Asian_Asian_British_Indian', 'Asian_Asian_British_Pakistani',
       'Asian_Asian_British_Bangladeshi', 'Asian_Asian_British_Other',
       'Black_Black_British', 'Black_Black_Caribbean', 'Black_Black_African',
       'Black_other', 'Chinese_other', 'Chinese_other_Chinese',
       'Chinese_other_other'],
      dtype='object')
In [220]:
#Remove special characters from columns in OA data

df_2001['geography'] = df_2001['geography'].str.replace('\W', '', regex=True)
df_2001['geography_code'] = df_2001['geography_code'].str.replace('\W', '', regex=True)
In [221]:
#Changing the data type for OA dataset

df_2001['All_White'] = pd.to_numeric(df_2001['All_White'], errors='coerce')
df_2001['White_British'] = pd.to_numeric(df_2001['White_British'], errors='coerce')
df_2001['White_Irish'] = pd.to_numeric(df_2001['White_Irish'], errors='coerce')
df_2001['White_Other'] = pd.to_numeric(df_2001['White_Other'], errors='coerce')

df_2001['Mixed'] = pd.to_numeric(df_2001['Mixed'], errors='coerce')
df_2001['Mixed_White_Black_Caribbean'] = pd.to_numeric(df_2001['Mixed_White_Black_Caribbean'], errors='coerce')
df_2001['Mixed_White_Black_African'] = pd.to_numeric(df_2001['Mixed_White_Black_African'], errors='coerce')
df_2001['Mixed_White_Asian'] = pd.to_numeric(df_2001['Mixed_White_Asian'], errors='coerce')
df_2001['Mixed_Other'] = pd.to_numeric(df_2001['Mixed_Other'], errors='coerce')

df_2001['Asian_Asian_British'] = pd.to_numeric(df_2001['Asian_Asian_British'], errors='coerce')
df_2001['Asian_Asian_British_Indian'] = pd.to_numeric(df_2001['Asian_Asian_British_Indian'], errors='coerce')
df_2001['Asian_Asian_British_Pakistani'] = pd.to_numeric(df_2001['Asian_Asian_British_Pakistani'], errors='coerce')
df_2001['Asian_Asian_British_Bangladeshi'] = pd.to_numeric(df_2001['Asian_Asian_British_Bangladeshi'], errors='coerce')
df_2001['Asian_Asian_British_Other'] = pd.to_numeric(df_2001['Asian_Asian_British_Other'], errors='coerce')

df_2001['Black_Black_British'] = pd.to_numeric(df_2001['Black_Black_British'], errors='coerce')
df_2001['Black_Black_Caribbean'] = pd.to_numeric(df_2001['Black_Black_Caribbean'], errors='coerce')
df_2001['Black_Black_African'] = pd.to_numeric(df_2001['Black_Black_African'], errors='coerce')
df_2001['Black_other'] = pd.to_numeric(df_2001['Black_other'], errors='coerce')

df_2001['Chinese_other'] = pd.to_numeric(df_2001['Chinese_other'], errors='coerce')
df_2001['Chinese_other_Chinese'] = pd.to_numeric(df_2001['Chinese_other_Chinese'], errors='coerce')
df_2001['Chinese_other_other'] = pd.to_numeric(df_2001['Chinese_other_other'], errors='coerce')
In [222]:
df_2001.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175434 entries, 1 to 175434
Data columns (total 25 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   date                             175434 non-null  object
 1   geography                        175434 non-null  object
 2   geography_code                   175434 non-null  object
 3   All_categories                   175434 non-null  object
 4   All_White                        175434 non-null  int64 
 5   White_British                    175434 non-null  int64 
 6   White_Irish                      175434 non-null  int64 
 7   White_Other                      175434 non-null  int64 
 8   Mixed                            175434 non-null  int64 
 9   Mixed_White_Black_Caribbean      175434 non-null  int64 
 10  Mixed_White_Black_African        175434 non-null  int64 
 11  Mixed_White_Asian                175434 non-null  int64 
 12  Mixed_Other                      175434 non-null  int64 
 13  Asian_Asian_British              175434 non-null  int64 
 14  Asian_Asian_British_Indian       175434 non-null  int64 
 15  Asian_Asian_British_Pakistani    175434 non-null  int64 
 16  Asian_Asian_British_Bangladeshi  175434 non-null  int64 
 17  Asian_Asian_British_Other        175434 non-null  int64 
 18  Black_Black_British              175434 non-null  int64 
 19  Black_Black_Caribbean            175434 non-null  int64 
 20  Black_Black_African              175434 non-null  int64 
 21  Black_other                      175434 non-null  int64 
 22  Chinese_other                    175434 non-null  int64 
 23  Chinese_other_Chinese            175434 non-null  int64 
 24  Chinese_other_other              175434 non-null  int64 
dtypes: int64(21), object(4)
memory usage: 33.5+ MB
In [223]:
# Select required columns to sum using DataFrame.iloc[] 

df_2001['Asian'] = df_2001.iloc[:,13:18].sum(axis=1)
df_2001['Black'] = df_2001.iloc[:,18:22].sum(axis=1)
df_2001['White'] = df_2001.iloc[:,4:8].sum(axis=1)
df_2001['Others'] = df_2001.iloc[:,[8,9,10,11,12,22,23,24]].sum(axis=1)
In [224]:
df_2001.head(1)
Out[224]:
date geography geography_code All_categories All_White White_British White_Irish White_Other Mixed Mixed_White_Black_Caribbean ... Black_Black_Caribbean Black_Black_African Black_other Chinese_other Chinese_other_Chinese Chinese_other_other Asian Black White Others
1 2001 00EHNC0001 00EHNC0001 261 255 255 0 0 3 0 ... 0 0 0 3 3 0 0 0 510 12

1 rows × 29 columns

In [225]:
#Dropping other features and including only required features

df_2001.drop(df_2001.iloc[:,3:-4], inplace=True, axis=1)
In [226]:
#Counting unique values in 2021 census data

n = len(pd.unique(df_2001["geography"]))
print("No.of.unique values :",n)

#Counting unique values in LAD data

n = len(pd.unique(df_lad_2001["OA01CD"]))
print("No.of.unique values :",n)
No.of.unique values : 175434
No.of.unique values : 175434
In [227]:
#Renaming the column name in LAD data

df_lad_2001.rename(columns={df_lad_2001.columns[0]: 'geography'},inplace=True)
In [228]:
#Dropping other features and including only required features in LAD data

df_lad_2001.drop(df_lad_2001.iloc[:,3:], inplace=True, axis=1)
In [229]:
#Remove the last 3 digits in location in LAD data

df_lad_2001['LSOA01NM'] = df_lad_2001['LSOA01NM'].str[:-5]

df_lad_2001.head(1)
Out[229]:
geography LSOA01CD LSOA01NM
0 00AAFA0001 E01000001 City of London
In [230]:
#Renaming the column name for LAD data for joining the data

df_lad_2001.rename(columns={'LSOA01CD':'LSOA','LSOA01NM':'LAD' },inplace=True)
In [231]:
#Checking if LAD and 2001 census data observations are same

df_lad_2001.shape
Out[231]:
(175434, 3)
In [232]:
# Join OA and LAD data

join_data_2001 = pd.merge(left = df_2001, right = df_lad_2001, how = 'left', on = 'geography' )
In [233]:
join_data_2001.head(1)
Out[233]:
date geography geography_code Asian Black White Others LSOA LAD
0 2001 00EHNC0001 00EHNC0001 0 0 510 12 E01012305 Darlington
In [234]:
#Dropping other features and including only required features in OA dataset

join_data_2001.drop(join_data_2001.iloc[:,[0,2]], inplace=True, axis=1)
join_data_2001.head(2)
Out[234]:
geography Asian Black White Others LSOA LAD
0 00EHNC0001 0 0 510 12 E01012305 Darlington
1 00EHNC0002 0 0 624 0 E01012305 Darlington
In [235]:
#Re-arranging the order of the column in joined data

census_2001 = join_data_2001.reindex(['LSOA', 'LAD', 'geography', 'Asian', 'Black', 'White', 'Others'], axis=1)
census_2001.head(2)
Out[235]:
LSOA LAD geography Asian Black White Others
0 E01012305 Darlington 00EHNC0001 0 0 510 12
1 E01012305 Darlington 00EHNC0002 0 0 624 0
In [236]:
#Saving joined dataset 2011:

census_2001.to_csv("Dataset/Processed_Data/processed_ethnic_data_2001.csv")
In [237]:
#Loading the joined census dataset

df_census_2001 = pd.read_csv("Dataset/Processed_Data/processed_ethnic_data_2001.csv", index_col=0)
df_census_2001.head(2)
Out[237]:
LSOA LAD geography Asian Black White Others
0 E01012305 Darlington 00EHNC0001 0 0 510 12
1 E01012305 Darlington 00EHNC0002 0 0 624 0
In [238]:
dfplot = df_census_2001.sort_values(by="Asian",ascending=False).head(20)

#Using a horizontal bar chat
#Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Asian')

#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='White')

#Adding legend and labels
plt.legend(loc='center right', prop={'size': 6})
plt.title('popluation by Ethnic Groups - Asians and White')
plt.xlabel('Poplulation - 2001')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)

#Displaying the plot
plt.show()
In [239]:
dfplot = df_census_2001.sort_values(by="Black",ascending=False).head(20)

#Using a horizontal bar chat
#Ploting the popluation of Asian
plt.barh(dfplot.LAD, dfplot.Asian, label='Black')

#Ploting the popluation of Black
plt.barh(dfplot.LAD, dfplot.Black, label='Others')

#Adding legend and labels
plt.legend(loc='center right', prop={'size': 6})
plt.title('popluation by Ethnic Groups - Black and Others')
plt.xlabel('Poplulation - 2001')
plt.ylabel('UK')
plt.rcParams['figure.figsize'] = (6,4)

#Displaying the plot
plt.show()
Insight: From the barplot which compares Asians and White, it can be seen Asians reside more in numbers in LAD than white. Wherelse, ethinic group comparison between Black and Others it can be seen that Others are more than black.
In [240]:
#Total popluation in the UK - creatin this for visualization purpose

Total_UK_2001_asian = int(df_census_2001['Asian'].sum())
Total_UK_2001_black = int(df_census_2001['Black'].sum())
Total_UK_2001_white = int(df_census_2001['White'].sum())
Total_UK_2001_others = int(df_census_2001['Others'].sum())

3.1 2001 census data - Bradford¶

In [241]:
# Selecting only Bradford from the joined dataset

data_bradford_2001 = df_census_2001[(df_census_2001.LAD == "Bradford")]
In [242]:
# creatin this for visualization purpose

Total_Bradford_2001_asian = data_bradford_2001['Asian'].sum()
Total_Bradford_2001_black = data_bradford_2001['Black'].sum()
Total_Bradford_2001_white = data_bradford_2001['White'].sum()
Total_Bradford_2001_others = data_bradford_2001['Others'].sum()
In [243]:
# create an Empty DataFrame object

uk_population = pd.DataFrame(columns = ['year','Asian', 'Black', 'White', 'Others'])
print(uk_population)
Empty DataFrame
Columns: [year, Asian, Black, White, Others]
Index: []
In [244]:
# append rows to an empty DataFrame - uk_population
uk_population = uk_population.append({
                        'year' : '2021' ,
                        'Asian' : Total_UK_2021_asian, 
                        'Black' : Total_UK_2021_black, 
                        'White' : Total_UK_2021_white, 
                        'Others' : Total_UK_2021_others}, 
                ignore_index = True)

uk_population = uk_population.append({
                        'year' : '2011' ,
                        'Asian' : Total_UK_2011_asian, 
                        'Black' : Total_UK_2011_black, 
                        'White' : Total_UK_2011_white, 
                        'Others' : Total_UK_2011_others}, 
                ignore_index = True)

uk_population = uk_population.append({
                        'year' : '2001' ,
                        'Asian' : Total_UK_2001_asian, 
                        'Black' : Total_UK_2001_black, 
                        'White' : Total_UK_2001_white, 
                        'Others' : Total_UK_2001_others}, 
                ignore_index = True)
In [245]:
uk_population['Total'] = uk_population['Asian'] + uk_population ['Black'] + uk_population ['White'] + uk_population ['Others']
uk_population.head()
Out[245]:
year Asian Black White Others Total
0 2021 5515892 2408980 48699515 2974062 59598449
1 2011 8427062 3729780 96418790 3576192 112151824
2 2001 4547668 2279020 95042004 2214618 104083310